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This invention relates to a system and method for processing signals to aid 
their classification and recognition. More specifically, the invention relates to 
5 a modified process for training and using both Gaussian Mixture Models and 
Hidden Markov Models to improve classification performance, particularly but 
not exclusively with regard to speech. 

Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs) are 
10 often used in signal classifiers to help identify an input signal when given a set 
of example inputs, known as training data. Uses of the technique include 
speech recognition, where the audio speech signal is digitised and input to the 
classifier, and the classifier attempts to generate from its vocabulary of words 
the- set of- words most likely to correspond-to- the input audio-signal; Another 
15 application is in radar, where radar signal returns from a scene are processed 
to provide an estimate of the contents of the scene. Published International 
specification WO02/08783 demonstrates the use of Hidden Markov Model 
processing of radar signals. 

• 20 Before a GMM or HMM can be used to classify a signal, it must be trained 
with an appropriate set of training data to initialise parameters within the 
model to provide most efficient performance. There are thus two distinct 
stages associated with practical use of these models, the training stage and 
the classification stage. With both of these stages, data is presented to the 
25 classifier in a similar manner. When applied to speech recognition, a set of 
vectors representing the speech signal is typically generated in the following 
manner. The incoming audio signal is digitised and divided into 10ms 
segments. The frequency spectrum of each segment is then taken, with 
windowing functions being employed if necessary to compensate for 
30 truncation effects, to produce a spectral vector.' Each element of the spectral 
vector typically measures the logarithm of the integrated power within each 
different frequency band. The audible frequency range is typically spanned by 
around 25 such contiguous bands, but one element of the spectral vector is 
conventionally reserved to measure the logarithm of the integrated power 
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across all frequency bands, i.e. the logarithm of the overall loudness of the 
sound. Thus, each spectral vector conventionally has around 25+1=26 
elements; in other words, the vector space is conventionally 26-dimensional. 
These spectral vectors are time-ordered and constitute the input to the HMM 
5 or GMM, as a spectrogram representation of the audio signal. 
' Training both the GMM and HMM involve establishing an optimised set of 
parameters associated with the processes using training data, such that 
optimal classification occurs when the model is subjected to unseen data. 

10 A GMM is a model of the probability density function (PDF) of its input vectors 
(e.g. spectral vectors) in their vector space, parameterised as a weighted sum 
of Gaussian components, or classes. Available parameters for optimisation 
are the means and covariance matrices for each class, and prior class 
probabilities. The prior class probabilities* are this weights of the weighted sum 
15 of the classes. These adaptive parameters are typically optimised for a set of 
training data by an adaptive, iterative, re-estimation procedure such as the 
Expectation Maximisation (EM), and log-likelihood gradient ascent algorithms, 
. . which are well known procedures for finding a set of values for all the adaptive 

parameters that maximises the training-set average of the logarithm of the 
20 model's likelihood function (log-likelihood). These iterative procedures refine 
the values of the adaptive parameters from one iteration to the next, starting 
from initial estimates, which may just be random numbers lying in sensible 
ranges. 

25 Once the adaptive parameters of a GMM have been optimised, those trained 
parameters may subsequently be used for identifying the most likely of the set 
of alternative models for any observed spectral vector, i.e. for classification of 
the spectral vector. The classification step involves the conventional 
procedure for computing the likelihood that each component of the GMM 

30 could have given rise to the observed spectral vector. 

Whereas a GMM is a model of the PDF of individual input vectors irrespective 
of their mutual temporal correlations, a HMM is a model of the PDF of time- 
ordered sequences of input vectors. The adaptive parameters of an ordinary 



m . • 

-3- 

n 

HMM are the observation probabilities (the PDF of input vectors given each 
possible hidden state of the Markov chain) and the transition probabilities (the 
set of probabilities that the Markov chain will make a transition between each 
pair-wise combination of possible hidden states). 

5 

For the case of art ordinary GMM-based HMM, the observation probabilities 
are parameterised as a weighted sum of Gaussian components ('classes'), 
i.e. the observation probabilities are parameterised as GMMs. Thus, a 
prescription for optimising the HMM's observation probabilities can be re-cast 
0 as a prescription for optimising the associated GMM's'cIass means, 
covariance matrices and prior class probabilities. 



Training, or optimisation, of the adaptive parameters of a HMM is done so as 
to maximise the overall likelihood function ofthe model of the input signal, 
5 such as a speech sequence. One common way of doing this is to use the 
Baum-Welch re-estimation algorithm, which is a development ofthe technique 
of expectation maximisation ofthe model's log-likelihood function, extended to 
allow for the probabilistic dependence ofthe hidden states on their earlier 
values in the speech sequence. A HMM is first initialised with initial, possibly 
random, assumptions for the values of the transition and observation 
probabilities. 

For each one of a set of sequences of input training vectors, such as speech- 
sequences, the Baum-Welch forward-backward algorithm is applied, to 
deduce the probability that the HMM could have given rise to the observed 
sequence. On the basis of all these per-sequence model likelihoods, the 
Baum-Welch re-estimation formula updates the model's assumed values for 
the transition probabilities and the observation probabilities (i.e. the GMM 
class means, covariance matrices and prior class probabilities), so as to 
maximise the increase in the model's average log-likelihood. This process is 
iterated, using the Baum-Welch forward-backward algorithm to deduce 
revised model likelihoods for each training speech-sequence and, on the 
basis of these, using the Baum-Welch re-estimation formula to provide further 
updates to the adaptive parameters. 
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Each iteration of the conventional Baum-Welch re-estimation procedure can 
be broken down into five steps for every GMM-based HMM: (a) applying the 
Baum-Welch forward-backward algorithm on every training speech-sequence, 
5 (b) the determination of what the updated values of the GMM class means 

* • should be for the next iteration, (c) the determination of what the updated 

values of the GMM class covariance matrices should be for the next iteration, 
(d) the determination of what the updated values of the GMM prior class 
probabilities should be for the next iteration, and (e) the determination of what 

10. the updated values of the HMM transition probabilities should be for the next 
iteration. Thus, the Baum-Welch re-estimation procedure for optimising a 
GMM-based HMM can be thought of as a generalisation of the EM algorithm 
for optimising a GMM, but with the updated transition probabilities as an extra, 
fourth output. 

15 

For certain applications, HMMs are employed that do not have their 
observation probabilities parameterised as GMMs, but instead use lower level 
HMMs. Thus, a hierarchy is formed that comprises at the top a "high level" 
HMM, and at the bottom a GMM, with each layer having its observation 
20 probabilities defined by the next stage down. This technique is common in 
subword-unit based speech recognition systems, where the structure 
comprises two nested levels of HMM, with the lowest one having GMM based 
observation probabilities. 

25 The procedure for optimising the observation probabilities of a high-level HMM 
reduces to the conventional procedure for optimising both the transition ■ 
probabilities and the observation probabilities (i.e. the GMM parameters) of 
the ordinary HMMs at the lower level, which is as described above. The 
procedure for optimising the high-level HMM's transition probabilities is the 

30 same as the conventional procedure for optimising ordinary HMMs' transition 
probabilities, which is as described above. 

HMMs can be stacked into multiple-level hierarchies in this way. The 
procedure for optimising the observation probabilities at any level reduces to 
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the conventional procedure for optimising the transition probabilities at all 
lower levels combined with the conventional procedure for optimising the 
GMM parameters at the lowest level. The procedure for optimising the 
transition probabilities at any level is the same as the conventional procedure 
5 for optimising ordinary HMMs' transition probabilities. Thus, the procedure for 
optimising hierarchical HMMs can be described in terms of recursive 
application of .the conventional procedures for optimising the transition and 
observation probabilities of ordinary HMMs. 

Once the HMM's adaptive parameters have been optimised, the trained HMM 
may subsequently be used, for identifying the most likely of a set of alternative 
models of an observed sequence of input vectors - spectral vectors in the 
case of speech classification. This process conventionally is achieved using 
the Baum-Welch forward-backward algorithm, which computes the likelihood 
of generating the observed sequence of input vectors from each of a set of 
alternative HMMs with different optimised transition and observation 
probabilities. 

The classification methods described above have certain disadvantages. 
When optimising the observation probabilities of the GMMs, and hence of the 
HMMs that may be hierarchically above them, as well as the transition 
probabilities of the HMM, there is a tendency for the optimisation to get caught 
in local minima, which prevents the system from achieving optimal 
classification. This can often be attributed to a tendency for class likelihood- 
PDFs to become "tangled up" with one another if they are free to become too 
highly anisotropic. Also, regarding speech recogniser technology, current 
recognisers are poor at capturing subtle variations and intrinsic characteristics 
of real speech, such as the full, specific variability of speakers' vowels under 
very different speaking conditions. In particular, individual vowels occupy 
complex shapes in spectral vector space, and attempting to represent these 
shapes as Gaussian distributions, as is conventionally done, can lead to 
unfaithful representation of the speech sounds. 
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According to the present invention there is provided a signal processing 
system for processing a plurality of multi-element data encoding vectors, the 
system: 

- having means for deriving the data encoding vectors from input 
5 signals; 

- • - being arranged to process the data encoding vectors using a - - 

Gaussian Mixture Model (GMM) based Hidden Markov Model (H~MM), the 
GMM based HMM having at least one class mean vector having multiple 
elements; 

1 0 - being arranged to process the elements of the class mean vector(s) 

by an iterative optimisation procedure; 

characterised in that the system is also arranged to scale the elements of the 
class mean vector(s) during the optimisation procedure to provide for the 
class mean vector(s) to have constant modulus at each* iteration, and to 
1 5 normalise the data encoding vectors input to the GMM based HMM. 

A GMM-based HMM is a generalisation of a GMM such that the HMM has 
observation probabilities parameterised as Gaussian PDFs or weighted sums 
of Gaussian PDFs, i.e. as a GMM. The observation probabilities of a GMM- 

20 based HMM are parameterised as a GMM, but the GMM-based HMM is not 
itself a GMM. An input stage can be added to a GMM based HMM however, 
where this input stage comprises a simple GMM. The log-likelihood of a 
GMM-based HMM is the log-likelihood of an HMM whose observation 
probabilities are constrained to be parameterised as GMMs; it is not the log- 

25 likelihood of a GMM. Consequently, the optimisation procedure of a GMM- 
based HMM is not the same as that of a GMM. 

Preferably the moduli of the mean vectors of each of the GMMs are rescaled 
after each iteration of the optimisation procedure so that they are all of equal 
value. 

30 

Most signal processing systems of the type discussed in this specification 
incorporate a GMM that represents the probability density function of all data 
encoding vectors in the training sequence. The constraint of limiting the 
elements of the class mean vector to have constant modulus leads to 
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simplified processing of theiGMMs making up the signal processing system, 
as the class means of each GMM will lie on the surface of a hypersphere 
having dimensionality (n -1), where n is the dimension of an individual vector. 

5 Preferably each covariance matrix is constrained so as to be isotropic and 
diagonal, and to have a variance constrained to be a constant value. This 
eliminates the possibility of certain classes of severe local minima associated 
with highly anisotropic Gaussian components, and so prevents such sub- 
optimal configurations from forming during the training process. Note that a 
1 0 covariance matrix that is so constrained may be regarded mathematically as 
a scalar value, and hence a scalar value may be used to represent such a 
covariance matrix. 

Each GMM, ah^'thefefofg GMM "based HMMy has ar set of prior class 
15 probabilities. Preferably the prior class probabilities associated with the GMM 
are constrained to be equal, and to remain constant throughout the 
optimisation procedure. 

Prior art signal processing systems incorporating GMMs generally avoid 
20 putting constraints on the model parameters; other than that covariance 
matrices are on occasion constrained to be equal across classes, 
requirements are rarely imposed on the class means, covariance matrices, 
prior class probabilities and hidden-state transition probabilities other than that 
their values are chosen to make the average log-likelihood as large as 
25 possible. 

Preferably, each data encoding vector that is also an input vector, derived 
from the input signal during both training and classifying stages of using the 
GMM is constrained such that its elements X/ are proportional to the square 
30 roots of the integrated power within different frequency bands. 

Advantageously, the elements of each such data encoding vector are scaled 
such that the squares of the elements of the vector sum to a constant value 
that is independent of the total power of the original input. 
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Preferably each such data encoding vector is augmented with the addition of 
one or more elements representing the overall power in the vector. The 
scaling of the vector elements described above removes any indication of the 
power, so the additional element(s) provide the only indication of the power, or 
5 loudness, within the vector. Clearly, the computation of the value of the 

■ elements representing power would need to be based on pre-scaled elements ■ 
of the vector. 

Certain applications, notably subword-unit based models advantageously 
10 employ a HMM that uses as its observation probability a GMM constrained 
according to the current invention, that likewise acts as the observation 
probability for a further HMM. In this way, a hierarchy of HMMs can be built 
up, in the manner of the prior art, but with the difference that the constraints 
on the model parameters according to the current invention are applied at 
15 each level of the hierarchy. 

Advantageously, the hierarchy may incorporate two GMMs as two lower 
levels, with a HMM at the highest level. The lowest level GMM provides 
posterior probabilities as a data encoding vector to a second, higher level 

20 ' GMM. This second GMM provides observation probabilities to a HMM at the 
third level. This arrangement allows individual speech-sounds to be 
represented in the spectral-vector space not as individual Gaussian ellipsoids, 
as is conventional, but as assemblies of many smaller Gaussian hypercircles 
tiling the unit hypersphere, offering in the potential for more faithful 

25 representation of highly complex-shaped speech-sounds, and thus improved 
classification performance. 

Note that in this specification the terms "input vector" and "spectral vector" are 
used interchangeably in the context of providing an input to the lowest level of 
30 the system hierarchy. The vector at this level may represent the actual power 
spectrum of the input signal, and hence be spectral coefficients, or may 
represent some modified form of the power spectrum. In practice, the input 
vector will generally represent a power spectrum of a segment of a temporal 
input signal, but this will not be the case for all applications. Further 




processing of the temporal input signal is used m some applications, e.g. 
cosine transform. A "data encoding vector" is, within this specification, any 
vector that is used as an input to any level of the hierarchy, depending on the 
context, i.e. any vector that is used as the direct input to the particular level of 
5 the hierarchy being discussed in that context. A data encoding vector is thus 
an input vector only when it represents the information entering the system at * 
the lowest level of the hierarchy. 

Note also that normalising a vector is the process of rescaling all its elements 
10 by the same factor, in order to achieve some criterion defined on the whole 
vector of elements. What that factor is depends on the criterion chosen for 
normalisation. A vector can generally be normalised by one of two useful 
criteria; one is to normalise such that the elements sum to a constant after 
normalfsation, the other is to normalise such" that the squares of the elements 

15 sum to a constant after normalisation. By the first criterion, 'the rescaling factor 
should be proportional to the reciprocal of the sum of the values of the 
elements before normalisation. By the second criterion, the rescaling factor 
should be proportional to the reciprocal of the square root of the sum of the 
squares of the values of the elements before normalisation. A vector of 

20 exclusive probabilities is an example of a vector normalised by the first 
criterion, such that the sum of those probabilities is 1. A (real-valued) unit 
vector is an example of a vector normalised according to the second criterion; 
the sum of the squares of the elements of a (real-valued) unit vector is 1. A 
vector whose elements comprise the square roots of a set of exclusive 

25 probabilities is also an example of a vector normalised by the second criterion. 

According to another aspect of the current invention there is provided a 
method of processing a signal, the signal comprising a plurality of multi- 
element data encoding vectors, wherein the data encoding vectors are 
30 derived from an analogue or digital input, and where the method employs at 
least one Gaussian Mixture Model (GMM) based Hidden Markov Model 
(HMM), the GMM based HMM having at least one class mean vector having 
multiple elements, and the elements of the class mean vector(s) are optimised 
in an iterative procedure, characterised in that the elements of the class mean 
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vectors are scaled during the optimisation procedure such that theclass mean 
vectors have a constant modulus at each iteration, and the data encoding 
vectors input to the GMM based HMM are processed such that they are 
normalised. 

5 

Note that the user(s) of a system trained according to the method of the 
current invention may be different to the user(s) who performed- the training. 
This is due to the distinction between the training and the classification modes 
of the invention 

10 

According to another aspect of the current invention there is provided a 
computer program designed to run on a computer and arranged to implement 
a signal processing system for processing one or more multi-element input 
vectors, the system: 
15 - having means for deriving the data encoding vectors from input 

signals; 

- being arranged to process the data encoding vectors using a 
Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the 
GMM based HMM having at least one class mean vector having multiple 

20 elements; 

- being arranged to process the elements of the class mean vector(s) 
by an iterative optimisation procedure; 

characterised in that the system is also arranged to scale the elements of the 
class mean vector(s) during the optimisation procedure to provide for the 
25 class mean vector(s) to have constant modulus at each iteration, and to 
normalise the data encoding vectors input to the GMM based HMM. 

The present invention can be implemented on a conventional computer 
system. A computer can be programmed to so as to implement a signal 
30 processing system according to the current invention to run on the computer 
hardware. 
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According to another aspect of the current invention there is provided a 
speech recogniser incorporating a signal processing system for processing 
one or more multi-element input vectors, the recogniser: 

- having means for deriving the data encoding vectors from input 
5 signals; 

- being arranged to process the data encoding vectors using a 
Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the 
GMM based HMM having at least one class mean vector having multiple 
elements; 

10 - being arranged to process the elements of the class mean vector(s) 

by an iterative optimisation procedure; 

characterised in that the system is also arranged to scale the elements of the 
class mean vector(s) during the optimisation procedure to provide for the 
class'mean vector(s) to have cbnsfahf modulus at each iteration, and' to 
1 5 normalise the data encoding vectors input to the GMM based HMM. 

A speech recogniser may advantageously incorporate a signal processing 
system as described herein, and may incorporate a method of signal 
processing as described herein. 

20 

The current invention will now be described in more detail, by way of example 
only, with reference to the accompanying Figures, of which: 

25 Figure 1 diagrammatically illustrates a typical hardware arrangement suitable 
for use with the current invention when implemented in a speech recogniser. 

Figure 2 shows in block diagrammatic form the conventional re-estimation 
procedure adopted by the prior art systems employing GMM or HMM based 
30 classifiers; 

Figure 3 shows in block diagrammatic form one of the pre-processing stages 
carried out on input vectors based on frames of speech, relating to the frame's 
spectral shape; 
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Figure 4 shows in block diagrammatic form a further pre-processing stage 
carried out on the input vectors relating to the overall loudness of a frame of 
speech; 

5 

Figure 5 shows in block diagrammatic form the modified re-estimation 
procedure of GMMs or ordinary, or hierarchical HMMs as per the current 
invention; 

10 Figure 6 shows in more detail the class mean re-scaling constraint shown in 
Figure 5; 

Figure 7 shows in block diagrammatic form the implementation of a complete 
system; and 

15 

Figure 8 shows graphically one advantage of the current invention using the 
example of a simplified three dimensional input vector space. 

The current invention would typically be implemented on a computer system 
20 having some sort of analogue input, an analogue to digital converter, and 
digital processing means. The digital processing means would comprise a 
digital store and a processor. As shown in Figure 1, a speech recogniser 
embodiment typically has a microphone 1 acting as a transducer from the 
speech itself, the electrical output of which is fed to an analogue to digital 
25 converter (ADC) 2. There may also be some analogue processing before the 
ADC (not shown). The ADC feeds its output to a circuit 3 that divides the 
digital signal into 10ms slices, and carries out a spectral analysis on each 
slice, to produce a spectral vector. These spectral vectors are then fed into 
the signal processor 4, in which is implemented the current invention. The 
30 signal processor 4 will have associated with it a digital storage 5. Some 
applications may have as an input a signal that has been digitised at some 
remote point, and so wouldn't have the ADC. Other hardware configurations 
are also possible within the scope of the current invention. 
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A typical signal processing system of the current invention will comprise a 
simple GMM and a GMM-based HMM, together used to classify an input 
signal. Before either of those models can be used for classification purposes, 
they must first be optimised, or trained, using a set of training data. There are 
thus two distinct modes of operation of a classification model: the training 
phase, and the classification phase. 

Figure 2 shows generically the steps used by prior art systems in training both 
a GMM and a HMM based classifier. Figure 2 depicts the optimisation of 
hierarchical GMM-based HMMs as well as the optimisation of ordinary GMM- 
based HMMs and simple GMMs, because the steps relating to initialising and 
re-estimating HMM transition probabilities relate to the initialisation and re- 
estimation of HMM transition probabilities at all levels of the hierarchy. 
The flow chart is erifefed from the fop when if is required to establish an 
improved set of parameters in the model to improve the classification 
performance. First the various classes need to be initialised, these being the 
class means, class covariance matrices and prior class probabilities. HMMs 
have the additional step of initialising the transition probabilities. These 
initialisation values,may be random, or they may be a "best guess" resulting 
either from some previous estimation procedure or from some other method. 

These initialisations form the adaptive parameters for the first iteration of the 
training procedure, which proceeds as follows. An data encoding vector or 
vector sequence (for the HMM case) from the training sequence is obtained, 
and processed using a known re-estimation procedure. For GMMs the EM 
algorithm is often used, and for HMMs the Baum-Welch re-estimation 
procedure is commonplace. This is the inner loop of the re-estimation 
procedure, and is carried out for all data encoding vectors in. the training 
sequence. 

Following this, the information gained during the inner loop processing is used 
to compute the new classes and, for the HMM case, the new transition 
probabilities. Convergence of this new data is tested by comparing it with the 
previous set or by judging whether the likelihood function has achieved a 
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sfable minimum, and the process re-iterated if necessary using the newly 
computed data as a starting point. 

Moving to the current invention, one embodiment of the current invention 
5 applied to speech recognition employs a modified spectral vector that is pre- 



representation of the prior art. The spectral vector itself comprises a spectral 
representation of a 10ms slice of speech, divided up into typically 25 
frequency bins. 

10 

The objective of the first stage of the pre-processing is that elements X/ 
(/=1 t ...,m) of the n-dimensional (m < n) spectral vector x should be 
proportional to the square roots JF t of integrated power P/ within different 
frequency bands, rather than the conventional logarithms of integrated power 

15 within different frequency bands. Further, elements x,- (/=1,...,n?) should be 

scaled such that their squares should sum to a constant A that is independent 
of the totai power integrated across all frequency bands within the frame 
corresponding to that spectral vector. Thus, if the frame is sampled into 
m frequency bands, m of the elements x,- of the n-dimensional (m <> n) spectral 

20 vector x should satisfy 



which implies Y^.i x j =zA ' 

The value of the constant A has no functional significance; all that matters is 
that it doesn't change from one spectral vector to the next. 

25 The advantage of this normalised square root power representation for 

spectral vectors is that the degree of match of the shape of spectral vector x,- 
(/=1,...,n?), compared with a class mean vector wi (/^.....n), is then 
proportional to the scalar product YL X ' W ' . irrespective of the modulus 
(vector length) of the template. This provides the freedom to constrain the 

30 modulus of the template without losing the functionality of being able to 



processed in a manner that is different from the conventional log-power 




(Equation 1) 



o 
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determine the degree of match of the template by computing the scalar 
product. 



The steps involved in the novel encoding of spectral vectors are represented 
5 in the flow diagram of Figure 3 and listed as follows (a-e). After (a) choosing a 
value for the constant A to be used for all frames of speech, (b) the first step 
to be applied for each individual frame of speech is the same as the 
conventional process for conducting a spectral analyisis in order to obtain 
m values of the integrated power P/(/=1,...,n?) within m different frequency 
10 bands spanning the audible frequency range. Then, instead of taking the 
logarithms of these power-values as is conventional in the prior art, (c) their 

sum and (d) their square roots ^ (/=1,...,/77) are computed, (e) each 

square-root value Jp t is then divided by. the.total power (and 

multiplied by a constant scaling factor A as desired) to obtain elements X/ 
15 (/=1,...,m) of the novel encoding of the spectral vector defined by equation 1. 

As a second part of the pre-processing of the spectral vectors, the vector is 
also augmented with the addition of extra elements that represent the overall 

loudness of the speech at that frame, i,e. the total power ^ M Pj integrated 

20 across all frequency bands-. 

This is particularly useful in conjunction with the novel way of encoding 
spectral shape defined by equation 1. This is because elements X/ (/=1 ? ... f /7?) 

defined by equation 1 are clearly independent of the overall loudness J^P, 

and therefore encode no information about it, so those m elements need to be 
25 augmented with additional information if the spectral vector is to convey 
loudness information. 

In the current embodiment, two extra elements x m +i and x^'are added to the 
spectral vector, beyond the m elements used to enc9de the spectral shape. 
30 Thus the spectral vector will have n - m+2 dimensions. These two elements 

depend on the overall loudness L = Y?j-\ P ) in the followin 9 way: 
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x. = B , ' , x- . = B-r = (Equation 2) 

where fiQ and g() are two (different) functions of the overall loudness L, and 
B is a constant. The significance of B is that the ratio B/A determines the 

relative contributions to the squared modulus | x | 2 = x • x = ^ x) made by 

5 "the two subsets of elements (/'= m+1, m+"2)"'an*d'(/*= t^./.mjrthe va1ues"of 
these contributions are clearly B 2 and A 2 respectively. The ratio B/A may 
therefore be used to control the relative importance assigned to overall 
loudness and spectral shape in the coding of spectral vectors; for example, 
choosing 6 = 0 assigns no importance to overall loudness, while choosing 
10 similar values of A and B assigns similar importance to both aspects of the 
speech. The value of /\ 2 +S 2 can be chosen to be 1 for simplicity, which will 

make the squared modulus |x| ? =x-x=2"-i*/ = a2 + b ' 2 equal to 1 for all 

spectral vectors regardless of their speech content. 

The advantages of this novel representation of loudness are (a) that the 

1 5 moduli of all spectral vectors will have the same constant value regardless of 
overall loudness, which frees one to constrain the moduli of templates (class 
means) w = (ivi,...,!/^), as is proposed in the main claims, and (b) that the 
ratio B/A may be used to control the relative importance assigned to overall 
loudness and spectral shape in the coding of spectral vectors. 

20 Possible choices for the functions f{) and g() include 

where L min and Z_ max are constants chosen to correspond to the quietest and 
loudest volumes (total integrated power) typically encountered in individual 
frames of speech. 

25 Useful values for the pair of constants (A 9 B) are (1,0), (^^) and (<y[%,Jf), 
which all satisfy A 2 +B 2 = 1 . 

Once functions fQ and g() and constants S, L mln and L max , to be used for all 
frames of speech, have been chosen, the steps involved in the process 
30 required to incorporate the loudness encoding as described above are shown 



1 W-W*, ) g(L) = c Jl tog£-i°gg* ) (Equation3) 

2 logl"" -logr™ Y SK } U log!™* -logI mm J 
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in Figure 4. The process involves (a) summing the integrated powers P/ 
within m frequency ranges /=1,...,/t? for each frame of speech to obtain the 
overall loudness L for that frame of speech, (b) evaluating the two extra 
elements x m +\ and Xm+2 for that frame of speech according to equation 2, and 
(c) for that frame of speech appending the two extra elements to the 
m elements obtained from the process of figure 4 to obtain an 
n=m+2 dimensional spectral vector incorporating the novel encodings of 
spectral shape and loudness. 

The steps as shown in Figures 3 and 4 comprise the pre-processing of the 
spectral vectors according to the embodiment of the current invention. 

The input vectors pre-processed as described above are used when 
optimising the various parameters of the GMMs and GMM-based HMMs. The 
inner loop of the optimisation procedure, as described in relation to Figure 1 
above, is done using convention methods such as EM re-estimation and 
Baum-Welch re-estimation, respectively. Further novel stages are concerned 
with applying constraints to the parameters in between iterations of this inner 
loop. 

Figure 5 shows the re-estimation procedure of the current invention, with 
additional processes present as compared to that shown in Figure 2. These 
additional processes relate to the initialisation of the classes before the 
iterative part of the procedure starts, and to the rescaling of the class means 
following each iteration to take into account the constraints to be imposed. 
Note that for the HMM case the transition probability processing is unchanged 
from the prior art. 

One of the constraints is concerned with the class mean vectors of the GMM 
or HMM. The constraint takes the form of re-scaling the set of n-dimensional 
vectors wy= (wyi,...Wp) which represent the class means. 



This constraint is applied'to all the class means, as soon as they have been 
re-estimated, every time they are re-estimated (by the EM or Baum-Welch re- 
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estimation procedures for example), and also when they are first initialised 
(see Figure 5). These extra steps, illustrated in the flow diagram of Figure 5, 
are (a) by summing the squares of its elements and then taking the square 
root pf the sum, the modulus |wy| of each of the N re-estimated class means Wj 
5 is first computed as 

l"=^/£M W y/ (Equation 4) 
for all N classes j = (b) after computing the modulus |w ; | of each re- 

estimated class mean, all the elements of each class mean are divided by that 
corresponding modulus, i.e. 

w ■■ 

10 w ff — » D — — , for all elements r = l,...,« of all GMM classes 7 = 1,. ..,N 
(Equation 5) 

These steps have the effect or re-scaling all the class means w, to constant 
modulus D until the next iteration of their re-estimation, after which they are 
re-scaled again to constant modulus D by applying these steps again, as 
15 depicted in Figure 5. The value of the constant D is preferably set equal to 
the modulus |x| of the data vectors x. (For example, for a GMM receiving input 



data having moduli |x| = oJa 2 +B 2 , the value of D should be set equal to 



20 The advantages of re-scaling the class means to constant modulus are that 
this encourages speech recognition algorithms to adopt novel encodings of 
speech data that may improve speech classification performance (such as 
hierarchical sparse coding), and that it may reduce the vulnerability of speech 
recognition algorithms to becoming trapped in undesirable sub-optimal 

25 configurations ('local minima') during training. These advantages result from 
the fact that the dynamics of learning have simplified degrees of freedom 
because the class means are constrained to remain on a hypersphere (of 
radius D) as they adapt. 

Re-scaling class means Wyto constant modulus is particularly appropriate in 
30 conjunction with scaling data vectors x to constant modulus. This is because 




1 
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n 

thedegree of match between a data vector x and a class mean v/j can then 
determined purely by computing the scalar product w . x . 

Further to the embodiment of the current invention, the covariance matrices C ; 
5 of the Gaussian distributions that constitute the GMMs are constrained to be 
isotropic and of constrained variance V, i.e. they are not optimised according 
to the conventional re-estimation procedures for covariance matrices (such as 
the EM algorithm for GMMs and the Baum-Welch procedure for GMM-based 
HMMs), but are defined once and for all in terms of the isotropic Identity 

1 0 Matrix I and the constrained variance V by 

Cj m VI for all classes j = 1,..., (Equation 6) 
V is a free parameter chosen (for example by trial and error) to give the 
speech recognition system best classification performance; Vmust be greater 
than zero, as a covariance matrix has non-negative eigenvalues, and V is 

1 5 preferably significantly smaller than the value of D 2 . The benefit of setting V 
much smaller than D 2 is that it leads to a sparse distribution of the first level 
simple GMM's posterior probabilities, which in the main embodiment feed the 
data encoding vector space of the GMM-based HMM at the second level. 
This is because each Gaussian component of the first level simple GMM will 

20 individually only span a small area on the spectral vector hypersphere. 

This process for choosing covariance matrices involves the following steps: 
(a) choosing a value for the constant of proportionality Vso as to optimise the 
classification performance, for example by trial and error, (b) setting all the 
25 diagonal elements of the class covariance matrices equal to V % and (c) setting 
all the off-diagonal elements of the class covariance matrices equal to zero. 
Thus, the covariance matrix according to this embodiment of the present 
invention is both isotropic and diagonal. 

30 Used in conjunction with the above techniques for constraining the moduli of 
data vectors x and class means Wy, constraining the class covariances in this 
way gives the advantage of encouraging speech recognition algorithms to 
adopt novel encodings of speech data that may improve speech recognition 
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performance (such as hierarchical sparse coding), and reducing the 
vulnerability of speech recognition algorithms to becoming trapped in 
undesirable sub-optimal configurations ('local minima') during training. Sparse 
coding results from representing individual speech-sounds as assemblies of 
many small isotropic Gaussian hypercircles tiling the unit hypersphere in the 
• spectral-vector space, offering in the potential for more faithful representation 
of highly complex-shaped speech-sounds than is permitted by representation 
as a single anisotropic ellipsoid, and thus improved classification 
performance. 

Because this constraint does away with the need for the conventional 
unconstrained re-estimation of the covariance matrices, Figure 5's modified 
procedure for optimising GMMs does not involve re-estimation of covariance 
matrices as does the conventional procedure of Figure' 2. 

A further constraint imposed on this embodiment of the current invention 
relates to the choice of prior class probabilities. The N prior probabilities Pr(/) 
for the GMM classes /= 1 r ...,A/may be constrained to be constants, i.e. not 
optimised according to the conventional re-estimation procedures for prior 
class probabilities (such as the EM algorithm for GMMs and the Baum-Welch 
procedure for GMM-based HMMs), but are defined once and for all by the 
step of setting 

Pr(y) = l/N for all classes j = 1,..., N (Equation 7) 
Used in conjunction with the above innovations for constraining the moduli of 
data vectors x, class means vyy and the covariance matrices Cy, constraining 
the prior class probabilities in this way gives the advantage of reducing the 
vulnerability of speech recognition algorithms to becoming trapped in 
undesirable sub-optimal configurations ('local minima') during training. 
Because this innovation does away with the need for the conventional 
unconstrained re-estimation of the prior class probabilities, Figure 5's modified 
procedure for optimising GMMs does not involve re-estimation of prior class 
probabilities as does the conventional procedure of Figure 2. 
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It will be understood by people skilled in the relevant arts that the constraints 
applied to a GMM or HMM as described above in the training phase of the 
model will equally need to be applied during the classifying phase of use of 
the models. If they were employed during training, the steps for encoding ■ 
5 spectral shape and overall loudness according to the present invention as 
described above will need to be applied to every spectral vector of any new 
speech to be classified. 

An implementation of the invention, which combines all of the constraints 
10 detailed above, is illustrated in Figure 6. This implementation uses 

conventional spectral analysis of each frame of speech, followed by novel 
steps described above to encode both spectral shape and overall loudness 
into each spectral vector and to scale every spectral vector's modulus to the 
constant value of 1 . The parameters A and B are both seTto equaT l/V? and 

15 D is set equal to 1. 

Such unit-modulus spectral vectors are input to a GMM having a hundred 
Gaussian classes (A/= 100), with class means all constrained to have moduli 
equal to 1, with class prior probabilities all constrained to have constant and 
equal values of 1/100, and covariance matrices constrained to be, isotropic 

20 and to have constant variances (i.e. not re-estimated at each iteration 

according to a procedure such as the EM algorithm). A good choice for that 
constant variance V has been found to be 0.01 , although other values could 
be chosen by trial and error so as to give best speech classification 
performance of the whole system; the right choice for V\n\\\ lie between 0 

25 and 1 . For each spectral vector input to this GMM, posterior probabilities for 
the classes are computed in the conventional way. 

Each set of GMM posterior probabilities computed above for each spectral 
vector are used to compute unit-modulus data-encoding vectors for input to an 
ordinary GMM-based HMM by taking the square roots of those posterior 
30 probabilities. 

These unit-modulus data-encoding vectors are input to the HMM as 
. observation vectors. The class means of the Gaussian mixture that constitutes 
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the parameterisation of the HMM's observation probabilities are, all 
constrained to have moduli equal to 1. The number N of Gaussian classes 
used to parameterise the HMM's observation probabilities is chosen by trial 
and error so as to give best speech classification performance of the whole 

5 system. The prior probabilities of those classes are then determined by that 
choice of N; they are all constrained' and set equal to VN. The covariance 
matrices of those classes are all constrained to be isotropic and to have 
constant variances (i.e. not re-estimated unconstrained according to a 
procedure such as the EM algorithm). The choice of that constant variance V 

0 would be determined by trial and error so as to give best speech classification 
performance of the whole system; the right choice for Vwill lie between 0 
and 1. 

The preferred implementation of the invention can be operated in training 
5 mode and classification mode. In classification mode, the HMM is used to 
classify the input observation vectors according to a conventional HMM 
classification method (Baum-Welch forward-backward algorithm or Viterbi 
algorithm), subject to the modifications described above. 
In training mode, (a) the GMM is optimised for the training of unit-modulus 
spectral vectors (encoded as described above) according to a conventional 
procedure for optimising GMM class means (e.g. the EM re-estimation 
algorithm), subject to the innovative modifications to re-scale the GMM class 
means to have constant moduli equal to 1, and to omit the conventional steps 
for re-estimating the GMM class covariance matrices and prior class 
probabilities, (b) Once the GMM has been optimised, it is used as described 
above to compute a set of data-encoding vectors from the training set of 
speech spectral vectors, (c) This set of data-encoding vectors is then used for 
training the HMM according to a conventional procedure for optimising HMM 
class means (e.g. the Baum-Welch re-estimation procedure), subject to the 
innovative modifications to re-scale the HMM class means to have constant 
moduli equal to 1, and to omit the conventional steps for re-estimating the 
HMM class covariance matrices and prior class probabilities. No modification 
is made to the conventional steps for re-estimating HMM transition 
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probabilities; the conventional Baum-Welch re-estimation procedure may be 
used for re-estimating HMM transition probabilities. 

Figure 8 illustrates the advantage of employing the constraints of the current 
invention. This shows a spectral vector x = (Xi, x 2 , x 3 ), where |x| = 1. 
Constraining this spectral vector, e.g. 101 into having a constant modulus has 
the implication that the class means 102 will all lie on the surface of a 
hypersphere. In the case shown the hypersphere has two dimensions, and so 
is an ordinary 2-sphere 103 in an ordinary three-dimensional space. 
Constraining the covariance matrices to be isotropic and diagonal has the 
effect that the individual classes will project onto this hypersphere in the form 
of circles 104. This arrangement allows individual speech-sounds to be 
represented in the spectral-vector space not as individual Gaussian ellipsoids, 
as Is conventional, but ^ assemblies T05 tff many smaller Gaussian 
hypercircles 104 tiling the unit hypersphere 103, offering in the potential for 
more faithful representation of highly complex-shaped speech-sounds, and 
thus improved classification performance. Each class (hypercircle) eg 104 will 
span just a small area within the complex shape that delimits the set of all 
spectral vectors (which must all lie on the spectral-vector hypersphere 103) 
that could correspond to alternative pronunciations of a particular individual 
speech-sound; collectively, many such classes 104 will be able to span that 
whole complex shape much more faithfully than could a single, anisotropic 
ellipsoid as is conventionally used to represent an individual speech sound. 
Other sets of Gaussian classes within the same mixture model will be able to 
span parts of other complex shapes on the spectral vector hypersphere, i.e. of 
other speech sounds. The posterior probabilities associated with each of 
these Gaussian classes (hypercircles) is a measure of how close the current 
spectral vector is (on the spectral-vector hypersphere) to the corresponding 
Gaussian class mean 102 (hypercircle centre). Learning which sets of classes . 
correspond to which speech sounds, on the basis of all the temporal 
correlations between them that are present in the training speech sequences, 
is the function of the GMM-based HMM, whose inputs are fed from the set of 
all those posterior probabilities. 
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To use an analogy, a large number of hypercircles helps one to avoid local 
minima far better than would a small number of anisotropic ellipsoids, for 
effectively the same reason that a bunch of sticks gets tangled more easily 
than a tray of marbles. (In this analogy, minimising the total gravitational 
5 potential of the set of marbles plays the analogous role to maximising the 
model likelihood.) Similarly; one can map out highly complex shapes much 
more faithfully by using a lot of marbles than by using a few sticks. 

The skilled person will be aware that other embodiments within the scope of 
10 the invention may be envisaged, and thus the invention should not be limited 
to the embodiments as herein described. 



15 References 



20 



A. R. Webb, Statistical Pattern Recognition, Arnold (London), 1999. 

B. H. Juang & L.R. Rabiner, Hidden Markov modeis for speech recognition, 
Technometrics 33(3), American Statistical Association, 1991. 



9 - 2s - 

Cfaims ' « • 

1 . A signal processing system for processing a plurality of multi-element 
data encoding vectors, the system: 
5 - having means for deriving the data encoding vectors from input 

signals; 

- being arranged to process the data encoding vectors using a 
Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the 
GMM based HMM having at least one class mean vector having multiple 

10 elements; 

- being arranged to process the elements of the class mean vector(s) 
by an iterative optimisation procedure; 

characterised in that the system is also arranged to scale the elements of the 
class mean vector(s) during the* optimisation procedure to provide for the 
15 class mean vector(s) to have constant modulus at each iteration, and to 
normalise the data encoding vectors input to the GMM based HMM. • 

2. A system as claimed in claim 1 wherein the GMM based HMM has a 
covariance matrix, the elements of which remain constrained during the 

20 optimisation procedure such that the matrix is isotropic and diagonal, and the 
value of the non zero diagonal elements remain constant throughout the 
optimisation procedure. 

3. A system as claimed in claim 1 or claim 2 wherein prior class 

25 probabilities associated with the GMM based HMM are constrained to be 
equal, and to remain unchanged throughout the optimisation procedure. 

4. A system as claimed in any of the above claims wherein when the 
elements of the data encoding vectors represent spectral coefficients, the 

30 normalisation of the data encoding vectors is such that the data encoding 
vectors have equal moduli. 



5. A system as claimed in claim 4 wherein the modulus of each data 
encoding vector is independent of the overall spectral power in the vector 



-26- 



6. A system as claimed in claim 4 or claim 5 wherein elements forming 
spectral coefficients of a data encoding vector are arranged to be individually 
proportional to the square root of the power in their corresponding spectral 

5 band divided by the square root of the overall power contained in spectral 
• bands represented in the vector. • 

7. A system as claimed in any of claims 4 to 6 wherein the system is 
arranged to add at least one additional element to each data encoding vector, 

1 0 wherein the added element(s) encode the overall power contained in spectral 
bands represented in the vector. 

8. A system as claimed in claim 7 wherein the system is arranged to add 
two elements to each data encoding vector to represent the overall power in 

1 5 spectral bands, these two elements arranged such that the sum of their 
squares is a constant across all data encoding vectors that represent the 
spectrum of the input signal. 

9. A system as claimed in any of claims 1 to 8 wherein the GMM based 
20 HMM provides the observation probabilities for a higher level HMM. 

1 0. A system as claimed in any of claims 1 to 9 wherein the derivation of 
the data encoding vectors from the input signal involves the use of a GMM, 
whereby this GMM provides the data encoding vectors to the GMM based 

25 HMM that comprise elements derived from the GMM's posterior probabilities. 

11. A system as claimed in claim 1 0 wherein elements of the data 
encoding vectors input from the GMM to the GMM based HMM are 
proportional to the square root of posterior probabilities of the additional GMM. 



30 



12. A system as claimed in claim 1 0 wherein elements of the data 
encoding vectors input from the GMM to the GMM based HMM are 
proportional to posterior probabilities of the additional GMM. 
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13. A system acclaimed in any of claims 9 to 12 wherein the constant 
values for the modulus of each of the class mean vectors may be different at 
each level. 

5 14. A method of processing a signal, the signal comprising a plurality of 
multi-element data encoding vectors, wherein the data encoding vectors are ' 
derived from an analogue or digital input, and where the method employs'at 
least one Gaussian Mixture Model (GMM) based Hidden Markov Model • 
(HMM), the GMM based HMM having at least one class mean vector having 

10 multiple elements, and the elements of the class mean vector(s) are optimised 
in an iterative procedure, characterised in that the elements of the class mean 
vectors are scaled during the optimisation procedure such that the class mean 
vectors have a constant modulus at each iteration, and the data encoding 
vectors input to the GMM- based HMM-are processed- such that they are 

15 normalised. 

15. A method as claimed in claim 14 wherein a covariance matrix within the 
GMM based HMM has one or more elements, all of which are constrained 
during the optimisation procedure such that the matrix is isotropic and 

20 diagonal, and. the value of its non zero elements remain constant throughout 
the optimisation procedure. 

16. A method as claimed in claim 14 or claim 15 wherein prior class 
probabilities associated with the GMM based HMM are constrained to be 

25 equal, and to remain unchanged throughout the optimisation procedure. 

17. A method as claimed in any of claims- 14 to 16 wherein when the 
elements of the data encoding vectors represent spectral coefficients, the data 
encoding vectors are scaled in a pre-processing stage before being input to 

30 the GMM based HMM, such that the moduli of all data encoding vectors are 
equal. 

1 8. A method as claimed in claim 17 wherein the modulus of each data 
encoding vector is independent of the overall power in the vector. 
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19. A method as claimed in claim 17 or claim 18 wherein elements forming 
spectral coefficients of a data encoding vector are arranged to be individually 
proportional to the square root of the power in their corresponding spectral 

5 band, divided by the square root of the overall power contained in spectral 
. .bands represented in the vector. 

20. A method as claimed in any of claims 17 to 1 9 wherein at least one 
additional element is added to each data encoding vector, wherein the added 

1 0 element(s) encode the overall power contained in spectral bands represented 
in the vector. 

21 . A method as claimed in claim 20 wherein two elements are added to 
each data encoding vector to represent the overall power in spectral bands, 

1 5 these two elements arranged such that the sum of their squares is a constant 
across all input vectors that represent the spectrum of the input signal. 

22. A method as claimed in any of claims 14 to 21 wherein the GMM based 
HMM provides the observation probabilities for a higher level HMM. 

20 

23. A method as claimed in any of claims 14 to 21 wherein the derivation of 
the data encoding vectors from the input signal, involves the use of a GMM, 
whereby this GMM provides the data encoding vectors to the GMM based 
HMM that comprise elements derived from the GMM's posterior probabilities. . 

25 

24. A method as claimed in claim 23 wherein elements of the data 
encoding vectors input from the GMM to the GMM based HMM are 
proportional to the square root of posterior probabilities of the additional GMM. 

30 25. A method as claimed in claim 23 wherein elements of the data 
encoding vectors input from the GMM to the GMM based HMM are 
proportional to posterior probabilities of the additional GMM. 
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26. A method as claimed in any of claims 22 to 25 wherein the constant 
values for the modulus of each of the class mean vectors may be different at 
each level. 

27. A signal processing system that has been trained according to the 
method as described in claim any of claims 14 to 26. 

28. A computer program designed to run on a computer and arranged to 
implement a signal processing system for processing one or more multi- 
element input vectors, the system: 

- having means for deriving the data encoding vectors from input 
signals; 

- being arranged to process the data encoding vectors using a 
Gaussian- Mixture Model (GMM) based Hidden Markov Model (HMM), the 
GMM based HMM having at least one class mean vector having multiple 
elements; 

- being arranged to. process the elements of the class mean vector(s) 
by an iterative optimisation procedure; 

characterised in that the system is also arranged to scale the elements of the 
class mean vector(s) during the optimisation procedure to provide for the 
class mean vector(s) to have constant modulus at each iteration, and to 
normalise the data encoding vectors input to the GMM based HMM. 

29. A speech recogniser incorporating a signal processing system for 
processing one or more multi-element input vectors, the recogniser: 

- having means for deriving the data encoding vectors from input 
signals; 

- being arranged to- process the data encoding vectors using a 
Gaussian Mixture Model (GMM) based Hidden Markov Model (HMM), the 
GMM based HMM having at least one class mean vector having multiple 
elements; 

- being arranged to process the elements of the class mean vector(s) 
by an iterative optimisation procedure; 
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characterised in that the system is also arranged to scale the elements of the 
class mean vector(s) during the optimisation procedure to provide for the 
class mean vector(s) to have constant modulus at each iteration, and to 
normalise the data encoding vectors input to the GMM based HMM. 
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Abstract w 

A signal processing system is disclosed which includes a Gaussian Mixture 
Model (GMM) based Hidden Markov Model (HMM), parameters of which are 
constrained during the optimisation procedure. Also disclosed is a constraint 
system applied to input vectors representing the input signal to the system. 
The invention is particularly, but not exclusively, related to speech recognition 
systems. The invention reduces the tendency, common in prior art systems, 
to get caught in .local minima associated with highly anisotropic Gaussian 
components - which reduces the recogniser performance - by employing the 
constraint system as above whereby the anisotropy of such components may 
be minimised. The invention also covers a method of processing a signal, 
and a speech recogniser trained according to the method. 
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speech into m frequency ranges 



^The m-dimensional vector each of whose 
elements i Q = 1, ...,;«) represents the 
integrated power P t (i = l,...,7w) within a 
different frequency range i (i = 1,.- -jn) of a 
frame of speech, obtained by conventional 
spectral analysis 
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Compute the square root VP, 

of each element 
P i (i = 1,.. ,,m) of the vector 




Compute the sum L=^ n j=iPj of 
the elements of the vector 
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Multiply each square-root element "Vf^ by 
a number which is the constant A divided 
by the sum L 





The m-dimensional constant-modulus vector 

(*!,... incorporating the novel 
"normalised square root power" encoding of 
the spectral shape of the frame of speech^ 
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The w-dimensional vector 
■ i x m) incorporating the 
novel "normalised square root 
power" encoding of the spectral 
shape of the frame of speech 
computed as in Figure 3 




The loudness 

Z=2> w P y of the frame 
of speech computed as 
in Figure 3 
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The w^/w+2-dimensional constant-modulus 
spectral vector (^...Arta) incorporating the 
novel encodings of the spectral shape and 
overall loudness of the frame of speech 



Figure 4 
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Start re-estimation procedure 



Initialise all HMM 

transition 
probabilities (HMM 
cases: not 
applicable for 
GMM case) 



Initialise class 
means 



Re-scale class 

means to 
constant moduli 



Initialise 
class 
covariance 
matrices as 

isotropic 



Initialise 
prior class 
probabilities 
equal 



^^^Adaptive parameters for next 
iteration of re-estimation procedure 



Obtain next spectral vector (GMM case) or 
sequence of spectral vectors (HMM cases) 



Inner loop of conventional re-estimation 
procedure (e.g. EM for GMM case or 
Baum-Welch for HMM cases) 



no 




yes 



Compute new HMM 

transition 
probabilities (HMM 
case: not applicable 
for GMM case) 



Conventional 
computation 
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means 



Re-scale class 

means to 
constant moduli 




End re- 
estimation 
procedure 



Figure 5 
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Set of vectors each representing a 
class mean, after the step of initialising 
the class means or after each iteration 

of re-estimation of the class means y 



For each vector representing one of 
these class means, compute the 
square root of the sum of the 
squares of its elements 



For each vector representing one of 
these class means, multiply its 
elements by that number which is 
the constant D divided by the square 
mol.of_the jsum .oLthe-squares. of. its 
elements computed above 



Each set of steps 
collectively referred 
to in Figure 5 as 
"Re-scale class 
means to constant 
moduli" 




Figure 6 



Set of vectors each representing a 
class mean, all having constant 
moduli equal to D 




> Classification of sequences of data-encoding 
vectors using a GMM-based HMM having 
constant-modulus class means and isotropic 
covariahce matrices (during the training 
phase, the HMM class means may be re- 
estimated according to Figure 5) 



.Constant-modulus data-encoding vectors for 
each spectral vector, comprised of the square 
jS. roots of the GMM posterior probabilities / 
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Compute the square roots of the GMM 
posterior probabilities 
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.Computation-otGMM-posterior-probabilities- 
for each spectral vector using a GMM having 
constant-modulus class means and isotropic 
covariance matrices (during the training 
phase, the GMM class means may be re- 
estimated according to Figure 5) 
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Sn= 7?i+2-dimensional constant-modulus 
spectral vectors (x u . . . ,x /ff+2 ) S 
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Encoding of loudness in two extra elements 
according to Figure 4 

$ = 



Spectral analysis of frames of speech 
followed by normalised square-root power 
encoding according to Figure 3 



Figure 7 
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